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Abstract: Wc establish rates of convergences in time series forecasting us- 
ing the statistical learning approach based on oracle inequalities. A series of 
papers (e.g. [MM98, MeiOO, BCV01, AW12]) extends the oracle inequalities 
obtained for iid observations to time series under weak dependence condi- 
tions. Given a family of predictors and n observations, oracle inequalities 
state that a predictor forecasts the series as well as the best predictor in the 
family up to a remainder term A n . Using the PAC-Bayesian approach, we 
establish under weak dependence conditions oracle inequalities with optimal 
rates of convergence A n . We extend results given in [AW12] for the abso- 
lute loss function to any Lipschitz loss function with rates A„ ~ y/c{Q)/n 
where c(S) measures the complexity of the model. We apply the method 
for quantile loss functions to forecast the french GDP. Under additional 
conditions on the loss functions (satisfied by the quadratic loss function) 
and on the time series, we refine the rates of convergence to A n ~ c(©)/n. 
We achieve for the first time these fast rates for uniformly mixing processes. 
These rates are known to be optimal in the iid case, see [Tsy03], and for 
individual sequences, see [CBL06]. In particular, we generalize the results 
of [DT08] on sparse regression estimation to the case of autoregression. 
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1. Introduction 

Time series forecasting is a fundamental subject in the mathematical statistics 
literature. The parametric approach contains a wide range of models associated 
with efficient estimation and prediction methods, sec e.g. [Ham94]. Classical 
parametric models include linear processes such as ARMA models [BD09] . More 
recently, non-linear processes such as stochastic volatility and ARCH models re- 
ceived a lot of attention in financial applications - see, e.g., the seminal paper by 
Nobel prize winner [Eng82], and [FZfO] for a more recent introduction. However, 
parametric assumptions rarely hold on data. Assuming that the data satisfy a 
model can biased the prediction and underevaluate the risks, see among others 
the the polemical but highly informative discussion in [Tal07]. 

In the last few years, several universal approaches emerged from various hclds 
such as non-parametric statistics, machine learning, computer science and game 
theory. These approaches share some common features: the aim is to build a 
procedure that predicts the time series as well as the best predictor in a given set 
of initial predictors 0, without any parametric assumption on the distribution 
of the observed time series. However, the set of predictors can be inspired by 
different parametric or non-parametric statistical models. We can distinguish 
two classes in these approaches, with different quantification of the objective, 
and different terminologies: 

• in the "prediction of individual sequences" approach, predictors are usu- 
ally called "experts" . The objective is online prediction: at each date t, a 
prediction of the future realization Xt+\ is based on the previous observa- 
tions xi, Xt, the objective being to minimize the cumulative prediction 
loss. See for example [CBL06, StolO] for an introduction. 

• in the statistical learning approach, the given predictors are sometimes 
referred as "models" or "concepts" . The batch setting is more classical in 
this approach. A prediction procedure is built on a complete sample X\, 

X n . The performance of the procedure is compared on the expected 
loss, called the risk, with the best predictor, called the "oracle" . The en- 
vironment is not deterministic and some hypotheses like mixing or weak 
dependence are required: see [MeiOO, MM98, AW12]. 

In both settings, one is usually able to predict a time series as well as the 
best model or expert, up to an error term that decreases with the number of 
observations n. This type of results is referred in statistical theory as oracle in- 
equalities. In other words, one builds on the basis of the observations a predictor 
6 such that 

R0)<MR{6) + A{n,Q) (1.1) 
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where R(0) is a measure of the prediction risk of the predictor 9 g 0. In gen- 
eral, the remainder term is of the order A(n, 0) ~ y/c(Q)/n in both approaches, 
where c(0) measures the complexity of 0. See, e.g., [CBL06] for the "individual 
sequences" approach; for the "statistical learning approach" the rate ^/c(0)/n 
is reached in [AW12] with the absolute loss function and under a weak depen- 
dence assumption. Different procedures are used to reach these rates. Let us 
mention the empirical risk minimization [Vap99] and aggregation procedures 
with exponential weights, usually referred as EWA [DT08, Gcrll] or Gibbs es- 
timator [Cat04, Cat07] in the batch approach, linked to the weighted majority 
algorithm of the online approach [LW94], see also [Vov90]. Note that results 
from the "individual sequences" approach can sometimes be extended to the 
batch setting, see e.g. [Gerll] for the iid case, and [AD11, DAJJ12] for mixing 
time series. 

In this paper, we extend the results of [AW12] to the case of a general loss 
function. Another improvement with respect to [AW12] is to study both the 
ERM and the Gibbs estimator under various hypotheses. We achieve here in- 
equalities of the form of (1.1) that hold with large probability (1 — e for any 
arbitratily small confidence level e > 0) with A(n, 9) ~ y c(0) Jn. We assume 
to do so that the observations are taken from a bounded stationary process (X t ) 
(see [AW12] however for some possible extensions to unbounded observations). 
We also assume weak dependence conditions on the process process (Xt). Then 
we prove that the fast rate A(n, 0) ~ c(0)/n can be reached for some loss 
functions including the quadratic loss. Note that [MeiOO, MM98] deal with the 
quadratic loss, their rate can be better than yJc(Q)/n but cannot reach c(0)/n. 

Our main results are based on PAC-Bayesian oracle inequalities. The PAC- 
Baycsian point of view emerged in statistical learning in supervised classification 
using the 0/1-loss, see the seminal papers [STW97, McA99]. These results were 
then extended to general loss functions and more accurate bounds were given, 
see for example [Cat04, Cat07, Alq08, AudlO, ALII, SLCB+12, DS12]. In PAC- 
Bayesian inequalities the complexity term c(0) is defined thanks to a prior 
distribution on the set 0. 

The paper is organized as follows: Section 2 provides notations used in the 
whole paper. We give a definition of the Gibbs estimator and of the ERM in 
Section 3. The main hypotheses necessary to prove theoretical results on these 
estimators are provided in Section 4. We give examples of inequalities of the 
form (1.1) for classical set of predictors in Section 5. When possible, we also 
prove some results on the ERM in these settings. These results only require a 
general weak-dependence type assumption on the time series to forecast. We 
then study fast rates under a stronger 0— mixing assumptions of [Ibr62] in Sec- 
tion 6. Note that the 0-mixing setting coincides with the one of [AD11, DAJJ12] 
when (X t ) is stationary. In particular, we arc able to generalize the results of 
[DT08, Gerll, ALII] on sparse regression estimation to the case of autoregres- 
sion. In Section 7 we provide an application to French GDP forecasting. A short 
simulation study is provided in Section 8. Finally, the proofs of all the theorems 
are given in Appendices A and B. 
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2. Notations 

Let Xi, . . . , X n denote the observations at time t € {1, . . . , n} of a time series 
X = (Xt) teZ defined on (Q, A, P). We assume that this series is stationary and 
take values in W equipped with the Euclidean norm || • ||. We fix an integer 
fc, that might depend on n, k = k(n), and assume that family of predictors 
is available: {fg : (M. p ) k — > W,9 £ 0}. For any parameter 9 and any time t, 
fg {X t -\, . . . ,X t -k) is the prediction of X t returned by the predictor 9 when 
given (Xt-i, . . . , X t -k)- For the sake of shortness, we use the notation: 

x t = f$(Xt-i, ■ ■ ■ ,Xt-k)- 

We assume that 9 h-> fg is a linear function. Let us fix a loss function I that 
measures a distance between the forecast and the actual realization of the series. 
Assumptions on £ will be given in Section 4. 

Definition 1. For any 9 € Q we define the prediction risk as 

R(6)=E i(xf,X t ^j 

(R{9) does not depend on t thanks to the stationarity assumption) . 

Using the statistics terminology, note that we may want to include para- 
metric set of predictors as well as non-parametric ones (i.e. respectively finite 
dimensional and infinite dimensional 0). Let us mention classical parametric 
and non-paramctric families of predictors: 

Example 1. Define the set of linear autoregressive predictors as 

fc 

fg(X t -i, . . . , X t ~k) = 9q + ^ 6jXt-j 

for9 = (9 ,9 1 ,...,9 k )eecR k+1 . 

In order to deal with non-parametric settings, we will also use a model- 
selection type notation: = U^L-^Qj. 

Example 2. Consider non-parametric auto-regressive predictors 



fg{X t -i, ■ . ■ , X t -k) = ^2 &i l Pi(X t _ 1 , . . . , X t _ k ) 
»=i 

where 9 = (Oi, . . . ,0j) € Qj C K- 5 and (ipi)^2. is a dictionnary of functions 
(M. p ) k — > M. p (e.g. Fourier basis, wavelets, splines...). 

3. ERM and Gibbs estimator 
3.1. The estimators 

As the objective is to minimize the risk R(-), we use the empirical risk r„(-) as 
an estimator of R(-). 
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Definition 2. For any 9e<3, r n {6) = ^ ELfc+i 1 (X* >^i) ■ 

Definition 3 (ERM estimator [Vap99]). We define the Empirical Risk Mini- 
mizer estimator (ERM) by 



]ERM 



£ are min r n (9) . 



Let T be a er-algebra on 9 and A4+(®) denote the set of all probability 
measures on (0,T). The Gibbs estimator depends on a fixed probability mea- 
sure 7r 6 j\4\(0) called the prior that will be involved when measuring the 
complexity of 0. 

Definition 4 (Gibbs estimator or EWA). Define the Gibbs estimator with in- 
verse temperature A > as 

-\r n {6) I^Q\ 

9p x {&9), where p x (d9) - 



J e -Ar n (e') 7r (d6l')' 
The choice of ir and A in practice is discussed in Section 5. 



3.2. Overview of the results 

Our results assert that the risk of the ERM or Gibbs estimator is close to 
infg R(9) up to a remainder term A(n, 0) called the rate of convergence. For 
the sake of simplicity, let 9 G be such that 

R(9) =inf R(6). 

If 9 docs not exist, it is replaced by an approximative minimizer 9 a satisfying 
R{9 a ) < inffl R(9) + a where a is negligible w.r.t. A(n, 0) (e.g. a < 1/n 2 ). We 
want to prove that the ERM satisfies, for any e > 0, 

P (R (§ ERM ^j < R(9) + A(n, 0, e)) > 1 - e (3.1) 

where A(n, 0, e) — > as n —> oo. We also want to prove that and that the Gibbs 
estimator satisfies, for any e > 0, 

P (i? (§ x ) < R{9) + A(n, A, tt, e)) > 1 - e (3.2) 

where A(n, A,7r,e) — > as n — > oo for some A = A(n). To obtain such results 
called oracle inequalities, we require some assumptions discussed in the next 
section. 
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4. Main assumptions 

We prove oracle inequalities under assumptions of two different types. On the 
one hand, assumptions LipLoss(A) and Lip(L) hold respectively on the loss 
function £ and the set of predictors 9. In some extent, we choose the loss function 
and the predictors, so these assumptions can always be satisfied. Assumption 
Margin(/C) also holds on I. 

On the other hand, assumptions Bound(S), WeakDep(C), PhiMix(C) hold 
on the dependence and boundcdness of the time series. In practice, we cannot 
know whether these assumptions are satisfied on data. However, remark that 
these assumptions are not parametric and are satisfied for many classical mod- 
els, sec [Dou94, DDL+07]. 

Assumption LipLoss(A'), K > 0: the loss function £ is given by £(x,x') = 
g{x — x') for some convex AT-Lipschitz function g such that g(0) — and g > 0. 

Example 3. A classical example in statistics is given by £{x,x') — \\x — x'\\, 
see [AW 12}. It satisfies LipLoss(A') with K = 1. In [MM98, MeiOO], the loss 
function used is the quadratic loss £{ 1 2 . It satisfies LipLoss(4£>) 

for time series bounded by a constant B > 0. 

Example 4. The class of quantile loss functions introduced in [KB78] is given 
by 

iT ^ y) = [r{x-y) if X -y>0 
I — (1 — r) (x — y) , otherwise 

where r G (0, 1) and x, y £ R. The risk minimizer of t i— > 1E(£ T (V — t)) is the 
quantile of order r of the random variable V. Choosing this loss function one can 
deal with rare events and build confidence intervals, see [Koe05, BC11, BP11}. 
In this case, LipLoss(A) is satisfied with K = max(r, 1 — r) < 1. 

Assumption Lip(A), L > 0: for any 6 £ O there are coefficients aj (6) for 
1 < i < such that, for any x\, xu and yi, yk, 

k 

\\fe (xi, ...,x k )- fe (yi, ■ ■ ■ ,yk)\\ <Y^ a i ( e ) W x o - Vj\\ > 

3=1 

Assumption Bound(S), B > 0: we assume that ||Xo|| < B almost surely. 

Remark that under Assumptions LipLoss(A'), Lip(£) and Bound(_B), the 
empirical risk is a bounded random variable. Such a condition is required in 
the approach of individual sequences. We assume it here for simplicity but it 
is possible to extend the slow rates oracles inequalities to unbounded cases see 
[AW12]. 

Assumption WeakDep(C) is about the 0oo,n(l)-weak dependence coefficients 
of [RioOO, DDL+07]. 
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Definition 5. For any k > 0, define the OOi fc(l)-weaA; dependence coefficients 
of a bounded stationary sequence (X t ) by the relation 

<W(1) := 



sup 

/eAf,o<ji <-<jh 



E [f(X h , • ■ • , Xj e )\X t ,t < 0] — E [f{X n x n y 

A%,0<h<—<jk 

where Aj 1 is the set of 1-Lipshitz functions of k variables 



I E- =1 K-^II " J 

The sequence (#oo,fc(l))fc>o is non decreasing with k. The idea is that as 
soon as Xk behaves "almost independently" from Xq, X-i, ... then 0oo,fc(l) — 
^oo.fc-i(l) becomes negligible. Actually, it is known that for many classical mod- 
els of stationary time series, the sequence is upper bounded, see [DDL+07] for 
details. 

Assumption WeakDep(C), C > 0: 0<x>,k(l) < C for any k > 0. 

Example 5. Examples of processes satisfying WeakDep(C) are provided in 
[AW12, DDL+07]. It includes Bernoulli shifts X t = H (£ t , £ t _i, . . . ) where the 
£t are iid, ||£o|| < b and H satisfies a Lipschitz condition: 

oo oo 

\\H{vi,v 2 ,-)-H{v' 1 ,v' 2 ,...)\\<^2a ] \\v : j-v' j \\ with ^jdj <OC. 

Then (X t ) is bounded by B = if (0,0, ...) + bC and satisfies WeakDep(C) with 
C = y^ln ja,j . In particular, solutions of linear ARMA models with bounded 
innovations satisfy WeakDep(C). 

In order to prove the fast rates oracle inequalities, a more restrictive depen- 
dence condition is assumed. It holds on the uniform mixing coefficients intro- 
duced by [Ibr62]. 

Definition 6. The <p-mixing coefficients of the stationary sequence (X t ) with 
distribution P are defined as 

4> r = sup \¥{B/A)-¥{B)\, 

(A,B)e a(X t ,t<0)xcr(X t ,t>r) 

Assumption PhiMix(C'), C > 0: 1 + E^Li Vfc < 

This assumption appears to be more restrictive than WeakDep(C) for bounded 
time series: 

Proposition 1 ([RioOO]). 

Bound(S) and PhiMix(C) => Bound(S) and WeakDep(CS). 
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(This result is not stated in [RioOO] but it is a direct consequence of the last 
inequality in the proof of Corollaire 1, p. 907 in [RioOO]). 

Finally, for fast rates oracle inequalities, an additional assumption on the 
loss function i is required. In the iid case, such a condition is also required. It 
is called Margin assumption, e.g. in [MT99, Alq08], or Bernstein hypothesis, 
[Lecll]. 

Assumption Margin(/C), JC > 0: 



E 



t[X q+1 , fe(X q , Xi) ) - £[ X q+ i,fg(X q , ...,Xi 



< 



JC [R(6) - R(6)] 



As assumptions Margin(/C) and PhiMix(C) are used only to obtain fast 
rates, we give postpone examples to Section 6. 



5. Slow rates oracle inequalities 



In this section, we give oracle inequalities (3.1) and/or (3.2) with slow rates 
of convergence A(n, 0) ~ y c(0) /n. The proof of these results are given in 
Section B. Note that the results concerning the Gibbs estimator are actually 
corollaries of a general result, Theorem 9, stated in Section A. We introduce the 
following notation for the sake of shortness. 

Definition 7. When Assumptions Bound(£>), LipLoss(-R'), hip(L) and WeakDep(C) 

are satisfied, we say that we are under the set of Assumption SlowRates(«) 
where k = K{1 + L)(B + C)/V2 . 



5.1. Finite classes of predictors 



Consider first the toy example where is finite with |0| = Al, M > 1. In 
this case, the optimal rate in the iid case is known to be ^/\og(M)/n, see e.g. 
[Vap99]. 

Theorem 1. Assume that |0| = M and that SlowRates(«;) is satisfied for 
k > 0. Let 7r be the uniform probability distribution on 0. Then the oracle 
inequality (3.2) is satisfied for any A > 0, e > with 

A( ^ \ 2Xk2 ^21ogj2M/£) 

A(n,A,7T,£) = — —2 + . 

n(l — k/n) * 

The choice of A in practice in this toy example is already not trivial. The 
choice A = y/log(M)n yields the oracle inequality: 

R(e x )<R(e ) + 2-i^( K V 21og(2/e) 



1 - k/n J yjn log(M) ' 
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However, this choice is not optimal and one would like to choose A as the mini- 
mizer of the upper bound 



2\k 2 



n(l — k/n,y 



2 log (M) 
A 



However k = k(K, L, B,C) and the constants B and C are, usually, unknown. 
In this context we will prefer the ERM predictor that performs as well as the 
Gibbs estimator with optimal A: 

Theorem 2. Assume that |0| = M and that SlowRates(K) is satisfied for 
k > 0. Then the oracle inequality (3.1) is satisfied for any e > with 



A(n,6,e) = inf 

A>0 



2Ak 2 



n (1 — k/nY 



2 log (2M/e) 
A 



4k 



1 — k/n 



log(2M/e) 



5.2. Linear autoregressive predictors 

We focus on the linear predictors given in Example 1. 

Theorem 3. Consider the linear autoregressive model of AR(fc) predictors 

k 

fe(xt-i, ■■■> at-*) = #0 + X] °i Xt -o 

with 6 e 6 = {9 e R fe+1 ,||0|| < L] such that Lip(i) is satisfied. A ssume 
that Assumptions Bound(B), LipLoss(A') and WeakDep(C) are satisfied. Let 
7r be the uniform probability distribution on the extended parameter set {9 £ 
R fc+1 , ||#|| < L + 1}. Then the oracle inequality (3.2) is satisfied for any A > 0. 
e > with 

A(n, A, 7r, e) = 

2A.2 (* + 1) log ( (^^)(L + i)^ \ + lQg (2/£) 

l_ 2 ^ 

n(l-k/nf A 



In theory, A can be chosen of the order y/(k + l)n to achieve the optimal 
rates y/(k~+ l)/n up to a logarithmic factor. But the choice of the optimal A 
in practice is still a problem. The ERM predictor still performs as well as the 
Gibbs predictor with optimal A. 

Theorem 4. Under the assumptions of Theorem 3, the oracle inequality (3.1) 
is satisfied for any e > with 

A(n,e,e) = 

2Ak 2 (fc + l)log(^|^)+21og ( 2/e) 



inf 

\>2KB/(k+l) 



n (1 — k/n) 2 
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The additional constraint on A does not depend on n. It is restrictive only 
when k + 1, the complexity of the autoregressive model, has the same order than 
n. For n sufficiently large and A = ((1 — fe/n)//c)-v/ ((k + l)n/2) satisfying the 
constraint A > 2KB/{k + 1) we obtain the oracle inequality 

R(0 ERM ) < RQS) 



2(fc + l) k , f2e 2 KB(R 
log 



1 — k/n \ n \j k + 1 

2 v / 2«log(2/e) 



y/(k + l)n(l-k/n) 

Theorems 3 and 4 are both direct consequences of the following results about 
general classes of predictors. 

5.3. General parametric classes of predictors 

We state a general result about finite-dimensional families of predictors. The 
complexity k + 1 of the autoregressive model is replaced by a more general 
measure of the dimension d(0,7r). We also introduce some general measure 
£)(0,7r) of the diameter that will, for most compact models, be linked to the 
diameter of the model. 

Theorem 5. Assume that SlowRates(ft) is satisfied and the existence of d = 
d(Q,ir) > and D = D(Q,tt) > satisfying the relation 

V<5 > 0, log = < dlog [ ^ 

Then the oracle inequality (3.2) is satisfied for any A > 0, e > with 

2Xk 2 dlog(D^X/d) + log(2/e) 



A(n, A, 7r, e) = 



n 



(1 - fc/n) 



2 



A 



A similar result holds for the ERM predictor under a more restrictive as- 
sumption on the structure of O, sec Remark 1 below. 

Theorem 6. Assume that 

1. = {6ER d : ||0||i < D}, 

2. WXl 1 - Xl 2 \\ < i/>. ||6»i -^Hi a.s. for some $ > and all (0 1 ,9 2 ) £ 6 2 . 

Assume also that Bound(B), LipLoss(A') and WeakDep(C) are satisfied and 
that Lip (A J holds on the extended model e' = {#eR d :||(9||i<.D + l}. Then 
the oracle inequality (3.1) is satisfied for any e > with 



A(n,9,e)= inf 

X>2Kili/d 



2Xk 2 d\og{2eKiP(D + l)\/d) + 21og(2/e) 



n (1 — fc/n) ^ 
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This result yields to nearly optimal rates of convergence for the ERM predic- 
tors. Indeed, for n sufficiently large and A = ((1 — k/n)/ n)^{dn/2) > 2Kip/d 
we obtain the oracle inequality 



R(Q ) < R(6) 



n 1 — k/i 



■log 



2e 2 KiP(D + 1) 



2 v / 2Klog(2/ £ 



dn (1 — k/n) 



Thus, the ERM procedure yields prediction that are close to the oracle with an 
optimal rate of convergence up to a logarithmic factor. 

Example 6. Consider the linear autoregressive model o/AR(fc) predictors stud- 
ied in Theorems 3 and 4- Then Jjip(L) is automatically satisfied with L = D + l. 
The assumptions of Theorem 6 are satisfied with d = k + 1 and if) = B. More- 
over, thanks to Remark 1, the assumptions of Theorem 5 are satisfied with 
D(9,7r) = (KB V K 2 B 2 )(R + 1). Then Theorems 3 and 4 are actually direct 
consequences of Theorems 5 and 6. 

Note that the context of Theorem 6 are less general than the one of Theo- 
rem 5: 

Remark 1. Under the assumptions of Theorem 6 we have for any 9 G O 



R(6) - R(0) =E{g(x°-X 1 )-g(X?-X 1 



< E|ir x{-x{ 

< Kt()\\e -e\\i. 

Define ir as the uniform distribution on Q' = {9 E M. d : \\9\\i < D + 1}. We 
derive from simple computation the inequality 



log 



< log- 



J eeQ 1{R(9) - R(9) < 6}tt(c\9) ~ ~° J gee 1{\\9 -6\\i< T^Wd9) 

= rflog( gi/ ' ( f +1) ) when 6/ Kip < 1 

< dlog (Kip(D + 1)) otherwise. 



Thus, in any case, 



log 



J gee 1{R(9) - R{9) < 5}Tr(d9) 



< dlog 



(KipV K 2 i/j 2 )(D+ 1) 



and the assumptions of Theorem 5 are satisfied for d(Q,n) = d and D(Q,tt) = 
[K^\J K 2 ifi 2 ){D + 1). 

As a conclusion, for some predictors set with a non classical structure, the 
Gibbs estimator might be preferred to the ERM. 
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5-4- Aggregation in the model- selection setting 

Consider now several models of predictors 0i, 9a/ and consider 9 = |_l i=1 Q% 
(disjoint union). Cur aim is to predict as well as the best predictors among all 
9j's, but paying only the price for learning in the Qj that contains the oracle. 
In order to get such a result, let us choose M priors irj on each models such 
that TTj(Qj) = 1 for all j <G {1, M}. Let it = J2jLi Pj n j be a mixture of these 
priors with prior weights pj > satisfying YljLiPj = Denote 

0j £ arg min R{ff) 

the oracle of the model Qj for any 1 < j < M. For any A > 0, denote p\j 
the Gibbs distribution on Qj and 6\j = f Q 6p\j(d0) the corresponding Gibbs 
estimator. A Gibbs predictor based on a model selection procedure satisfies an 
oracle inequality with slow rate of convergence: 

Theorem 7. Assume that: 

1. Bound(S) is satisfied for some B > 0; 

2. LipLoss(_R') is satisfied for some K > 0; 

3. WeakDep(C) is satisfied for some C > 0; 
4- for any j G {1, M} we have 

(a) liip(Lj) is satisfied by the model Qj for some Lj > 0, 

(b) there are constants dj = d(Qj,ir) and Dj = c(Qj,TTj) are such that 

VS > 0, log < dj log ( ^ 

J eee . 1{R(0) - R(9j) < 5}nj(d6) ~ 1 *\ 5 

Denote Kj = K{K,Lj,B,C) = K(l + Lj)(B + C)/V2 and define 9 = X _ ~. where 
j minimizes the function of j 

r (ft\n (cW\ + X > K ' 4. ^i.i.T,-) +^(2/(^0) 



n(l - k/n) 2 A 



with 



Xj = arg min 



A>0 



2\nj f ^ log (DjeX/dj) + log (2/(e Pj )) 



n (1 — k/n) A 



Then, with probability at least 1 — e, the following oracle inequality holds 



Riff) < inf 

l<j<M 
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The proof is given in Appendix B. A similar result can be obtained if we 
replace the Gibbs predictor in each model by the ERM predictor in each model. 
The resulting procedure is known in the iid case under the name SRM (Struc- 
tural Risk Minimization), see [Vap99], or penalized risk minimization, [BM01]. 
However, as it was already the case for a fixed model, additional assumptions 
are required to deal with ERM predictors. In the model-selection context, the 
procedure to choose among all the ERM predictors also depends on the unknown 
Kj's. Thus the model-selection procedure based on Gibbs predictors outperforms 
the one based on the ERM predictors. 

6. Fast rates oracle inequalities 
6.1. Discussion on the assumptions 

In this section, we study conditions under which the rate 1/n can be achieved. 
These conditions are restrictive: 

• now p = 1, i.e. the process (-X"t)teZ is real- valued; 

• the dependence condition WeakDep(C) is replaced by PhiMix(C); 

• we assume additionally Margin(/C) for some fC > 0. 

Let us provide some examples of processes satisfying the uniform mixing 
assumption PhiMix(C). In the three following examples (e t ) denotes an iid 
sequence (called the innovations). 

Example 7 (AR(p) process). Consider the stationary solution (X t ) of an 
AR(p) model: \/t £ Z, X t = Yjj=i a jXt-j + e t . Assume that (e f ) is bounded 
with a distribution possessing an absolutely continuous component. If A{z) = 
a i z ^ has no root inside the unit disk in C then (X t ) is a geometrically 
<f>-mixing processe, see [AP86] and PhiMix(C) is satisfied for some C. 

Example 8 (MA(p) process). Consider the stationary process (X t ) such that 
Xt = Xw=i °j e t-j for all t € Z. By definition, the process {Xt) is stationary 
and <p- dependent - it is even p- dependent, in the sense that tfi r = for r > p. 
Thus PhiMix(C) is satisfied for some C > 0. 

Example 9 (Non linear processes). For extensions of the AR(p) model of the 
form X t = F(Xt—i,...,Xt— p ;et) ) ^-mixing coefficients can also be computed 
and satisfy PhiMix(C). See e.g. [MT93]. 

We now provide an example of predictive model satisfying all the assumptions 
required to obtain fast rates oracle inequalities, in particular Margin(/C), when 
the loss function £ is quadratic, i.e. £(x,x') = (x — x') 2 : 

Example 10. Consider Example 2 where 

N 

fe(X t -x,. . .,X t -k) = ^ 9iPi(X t -i, . . . ,X t -k), 

i=l 
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for functions {(pi)?l of (W) k to W , and 9 = (0 lt . . . , N ) £ R N . Assume the cpi 
upper bounded by 1 and = {9 £ M. N , ||0||i < Li} such that Lip(L). Moreover 
LipLoss(if) is satisfied with K = 2B. Assume that 9 = argming gR w R(9) £ 
in order to have: 

E < (X q+ i - fg(X q , ...,X!)) 2 - (x q+1 - f w (X q , ...,X 1 

{[fe(X q ,...,X 1 )-f I (X g ,...,X 1 )] 2 

[2X q+1 - f e (X q , ...,X 1 ) - f I (X q ,...,X 1 )] 2 } 
< E { [f e (X q , X x ) - f 7 (X q , Xx)] 2 4£ 2 (1 + i?) 2 } 



E 



< 4B 2 (1 + Rf [R{9) - R(9)] by Pythagorean theorem. 

Assumption Margin(/C) is satisfied with hZ = 4B 2 (1 + D) 2 . According to The- 
orem 8 below, the oracle inequality with fast rates holds as soon as Assumption 
PhiMix(C) is satisfied. 

6.2. General result 

We only give oracle inequalities for the Gibbs predictor in the model-selection 
setting. In the case of one single model, this result can be extended to the 
ERM predictor. For several models, the approach based on the ERM pre- 
dictors requires a penalized risk minimization procedure as in the slow rates 
case. In the fast rates case, the Gibbs predictor itself directly have nice prop- 
erties. Let 6 = Ufii ©i (disjoint union), choose tt = Ylj=iPj 7r j arL d denote 
9j £ argmineeej R{@) as previously. 
Theorem 8. Assume that: 

1. Margin(/C) and LipLoss(A') are satisfied for some K, K, > 0; 

2. Bound(S) is satisfied for some B > 0; 

3. PhiMix(S) is satisfied for some C > 0; 
4- liip(L) is satisfied for some L > 0; 

5. for any j £ {!,..., M}, there exist dj = d(0j,7r) and Dj = D(Qj,nj) 
satisfying the relation 



1 CD 



V<5 > 0, log = < dj log . 

J eeei 1{R(9) R{9 j ) < 5}n 3 (d9) \ 5 



Then for 

n — k n — k 
A = — — — — A 



4kKLBC 16fcC 
the oracle inequality (3.2) for any e > with 
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A(n, A, 7r, e) 

= 4inf \ R{9j) - R{6) + AkC (4 V KLB) 



dj log 



( 



Dje(n-k) \ 
lSkCdj J 



n — k 



+ log (-2- 




Compare with the slow rates case, we don't have to optimize with respect 
to A as the optimal order for A is independent of j. In practice, the value of A 
provided by Theorem 8 is too conservative. In the iid case, it is shown in [DT08] 
that the value A = nj (4cr 2 ) , where a 1 is the variance of the noise of the regression 
yields good results. In our simulations results, we will use A = n/var(A), where 
var(X) is the empirical variance of the observed time series. 

Notice that for the index j such that R{0j a ) = R{9) we obtain: 



So, the oracle inequality achieves the fast rate dj /nlog {n/dj ) where jo is the 
model of the oracle. However, note that the choice j = jo does not necessarily 
reach the infimum in Theorem 8. 

Let us compare the rates in Theorem 8 to the ones in [MeiOO, MM98, AD11, 
DAJJ12]. In [MeiOO, MM98], the optimal rate 1/n is never obtained. The paper 
[AD 11] proves fast rates for online algorithms that are also computationally 
efficient, see also [DAJJ12]. The fast rate 1/n is reached when the coefficients 
(4> r ) are geometrically decreasing. In other cases, the rate is slower. Note that 
we do not suffer such a restriction. The Gibbs estimator of Theorem 8 can also 
be computed efficiently thanks to MCMC procedures, see [ALII, DT08]. 

6.3. Corollary: sparse autoregression 

Let the predictors be the linear autoregressive predictors 




R{9) + AkC (4 V KLB) 



d 30 log (c J0 e(n - k)/(16kCd j0 )) + log (2/(ep jo )) 



n — k 




For any J c {1, . . . ,p}, define the model: 



9.7 = {9 e « p : 116*11! < L and } ■ ^ <=> j € J}. 



Let us remark that we have the disjoint union O = |J / C {! p j ©j = {9 € 
K p : || 6*|| i < 1}. We choose ttj as the uniform probability measure on 0j and 
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Corollary 1. Assume that 8 = argmin 0gR iv R(9) € 6 and PhiMix(C) is sat- 
isfied for some C > as well as Bound(S). Then the oracle inequality (3.2) is 
satisfied for any e > with 

Af ^ \ A-AvfB^ p/7ft_L f |JUog(("-fcW|J|)+log(|) l 
A(n, A,7r,e) = 4mf < i?(6>j) - R{6) + est. ; > 

J | n — k 

for some constant est = cst(i3,C, L). 

This extends the results of [ALII, DT08, Gerll] to the case of autoregression. 

Proof. The proof follows the computations of Example 10 that we do not re- 
produce here: we check the conditions LipLoss(X) with K = 2B, Lip(L) and 
Margin(/C) with K = AB 2 {1 + L) 2 . We can apply Theorem 8 with dj = \J\ and 
Dj = L. □ 



7. Application to French GDP forecasting 
7. 1 . Uncertainty in GDP forecasting 

Every quarter t > 1, the French national bureau of statistics, INSEE 1 , publishes 
the growth rate of the French GDP (Gross Domestic Product). Since it involves 
a huge amount of data that take months to be collected and processed, the 
computation of the GDP growth rate log(GDP t /GDP f _i) takes a long time 
(two years). This means that at time t, the value log(GDP t /GDP f _i) is actually 
not known. However, a preliminary value of the growth rate is published 45 days 
only after the end of the current quarter t. This value is called a flash estimate 
and is the quantity that INSEE forecasters actually try to predict, at least in 
a first time. As we want to work under the same constraint as the INSEE, we 
will now focus on the prediction on the flash estimate and let AGDP t denote 
this quantity. To forecast at time t, we will use: 

1. the past forecastings 2 AGDPj, < j < t; 

2. past climate indicators Ij, < j < t, based on business surveys. 

Business surveys are questionnaires of about ten questions sent monthly to a 
representative panel of French companies (see [Dev84] for more details). As 
a consequence, these surveys provide informations from the economic decision 
makers. Moreover, they are available each end of months and thus can be used 
to forecast the french GDP. INSEE publishes a composite indicator, the French 
business climate indicator that summarizes information of the whole business 
survey, see [CM09, DM06]. Following [CorlO], let I t be the mean of the last 
three (monthly based) climate indicators available for each quarter t > at the 
date of publication of AGDP t . All these values (GDP, climate indicator) arc 

1 Institut National de la Statistique et des Etudes Economiqueshttp:/ /www. insee.fr/ 

2 It has been checked that to replace past flash estimates by the actual GDP growth rate 
when it becomes available do not improve the quality of the forecasting [MinlO]. 
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available from the INSEE website. Note that a similar approach is used in other 
countries, see e.g. [BBR08] on forecasting the European Union GDP growth 
thanks to EUROSTATS data. 

In order to provide a quantification of the uncertainty of the forecasting, 
associated interval confidences are usually provided. The ASA and the NBER 
started using density forecasts in 1968, while the Central Bank of England and 
INSEE provide their prediction with a fan chart, sec ec [DTW97, TWOO] for 
surveys on density forecasting and [BFW98] for fan charts. However, the statis- 
tical methodology used is often crude and, until 2012, the fan charts provided 
by the INSEE was based on the homoscedasticity of the Gaussian forecasting 
errors, see [CorlO, Dow04]. However, empirical evidences are 

1. the GDP forecasting is more uncertain in a period of crisis or recession; 

2. the forecasting errors are not symmetrically distributed. 



7.2. Application of Theorem 6 for the GDP forecasting 

Define X t as the data observed at time t: X t = (AGDP*, I t )' e IR 2 . We use the 
quantile loss function (see Example 4 page 6) for some < r < 1 of the quantity 
of interested AGDP t : 

£ T ((AGDP t ,/ t ), (A'GDPt, I' t )) 

[t (AGDP t - A'GDP t ) , if AGDPt - A'GDP t > 

~ [- (1 - t) (AGDPt - A'GDPt) , otherwise. 

We use the family of forecasters proposed by [CorlO] given by the relation 

/ e (AVi , A t _ 2 ) = 9 + 0i AGDP t _i + 2 I t -i + W-i - / t -a)|i"*-i - U-% I (7-1) 

where 6 = (0 O , 0i, 6 2 , 3 ) e &(B). Fix D > and 

e = {0 = (0oA, fe, 6 3 ) gm 4 ,||0||i = W ^ D 

^ i=0 

Let us denote R T (6) :=E[£ T (AGDP t , f$(X t -i, X t - 2 ))} the risk of the forecaster 
fg and let r r n denote the associated empirical risk. We let ERM < T denote the 
ERM with quantile loss £ T : 



qERALt 



€ argminrX(0). 



We apply Theorem 6 as Lip(L) is satisfied O' with L = D+l and LipLoss(i^) 
with K = 1. If the observations are bounded, stationary such that WeakDep(C) 
holds for some C > 0, the assumptions of Theorem 6 are satisfied with ijj = B 
and d = 4: 
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Corollary 2. Let us fix r £ (0, 1). If the observations are bounded, stationary 
such that WeakDep(C) holds for some C > then for any e > and n large 
enough, we have 

f ,r t nER.M,r, / s . e Df/m , W3 ( 2e 2 B(D + 



fl T (fl MM ' T ) < inf R T {9) + lo Q 



> 1 



In practice the choice of D has little importance as soon as D is large enough 
(only the theoretical bound is influenced). As a consequence we take D = fOO 
in our experiments. 



7.3. Results 



The results are shown in Figure 1 for forecasting corresponding to t = 0.5. 
Figure 2 represents the confidence intervals of order 50%, i.e. t = 0.25 and 
r = 0.75 (left) and for confidence interval of order 90%, i.e. r = 0.05 and 
t = 0.95 (right). We report only the results for the period 2000-QI to 2011-Q3 
(using the period 1988-Q1 to 1999-Q4 for learning). 



Out-of-sample forecasts 




2O00 2O02 2O04 2O0S 3008 201 O 



Fig 1. French GDP forecasting using the quantile loss function with r = 0.5. 

We denote 9 ERM ' T [t] the estimator computed at time t — 1, based on the 
observations Xj, j < t. We report the online performance: 

mean abs. pred. error = i Y,t=i AG£>P t - / fi -KSM,o.sr ( j(^t-i,^{-2) 
mean quad. pred. error = i J2t=i &GDP t - f§ BRMt0 . B ^ (X t -uXt-2) 

and compare it to the INSEE performance, see Table 1. We also report the 
frequency that the GDPs fall above the predicted r-quantiles for each r, see 
Table 2. Note that this quantity should be close to r. 
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□ut-of-sample forecasts Out-of-sample forecasts 




2000 2002 2004 2006 2008 2010 2000 2002 2004 2006 2008 2010 



FlG 2. French GDP online 50%-confidence intervals (left) and 90%-confidence intervals (right). 



Predictor 


Mean absolute prediction error 


Mean quadratic prediction error 


'cjERMfi.S 
INSEE 


0.2249 
0.2579 


0.0812 
0.0967 



Table 1 

Performances of the ERM and of the INSEE. 



T 


Estimator 


Frequency 


0.05 


nERMfi.Oh 


0.1739 


0.25 


gERM,0.25 


0.4130 


0.5 


qERM,0.5 


0.6304 


0.75 


qERM,0.75 


0.9130 


0.95 


qERM,0.95 


0.9782 



Table 2 

Empirical frequencies of the event: GDP falls under the predicted r-quantile. 



The methodology fails to forecast the importance of the 2008 subprime crisis 
as it was the case for the INSEE forecaster, see [CorlO]. However, it is interesting 
to note that the confidence interval is larger at that date: the forecast is less 
reliable, but thanks to our adaptive confidence interval, it would have been 
possible to know at that time that the prediction was not reliable. Another 
interesting point is to remark that the lower bound of the confidence intervals 
are varying over time while the upper bound is almost constant for r = 0.95. 
It supports the idea of asymmetric forecasting errors. A parametric model with 
gaussian innovations would lead to underestimate the recessions risk. 



8. Simulation study 

In this section, we finally compare the ERM or Gibbs estimators to the Quasi 
Maximum Likelihood Estimator (QMLE) based method used by the R function 
ARMA [R D08]. The idea is not to claim any superiority of one method over 
another, it is rather to check that the ERM and Gibbs estimators can be safely 
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n 


Model 


Innovations 


ERM abs. 


ERM quad. 


QMLE 


100 


(8.1) 


Gaussian 
Uniform 


0.1436 (0.1419) 
0.1594 (0.1512) 


0.1445 (0.1365) 
0.1591(0.1436) 


0.1469 (0.1387) 
0.1628 (0.1486) 




(8.2) 


Gaussian 
Uniform 


0.1770 (0.1733) 
0.1520 (0.1572) 


0.1699 (0.1611) 
0.1528 (0.1495) 


0.1728 (0.1634) 
0.1565 (0.1537) 


1000 


(8.1) 


Gaussian 
Uniform 


0.1336 (0.1291) 
0.1718 (0.1369) 


0.1343 (0.1294) 
0.1729 (0.1370) 


0.1345 (0.1296) 
0.1732 (0.1372) 




(8.2) 


Gaussian 
Uniform 


0.1612( 0.1375) 
0.1696 (0.1418) 


0.1610 (0.1367) 
0.1687 (0.1404) 


0.1613 (0.1369) 
0.1691 (0.1407) 



Table 3 

Performances of the ERM estimators and ARMA, on the simulations. The first row "ERM 
abs. " is for the ERM estimator with absolute loss, the second row "ERM quad. " for the 
ERM with quadratic loss. The standard deviations are given in parentheses. 



used in various contexts as their performances are close to the standard QMLE 
even in the context where the series is generated from an ARMA model. It 
is also the opportunity to check the robustness of our estimators in case of 
misspccification. 

8. 1 . Parametric family of predictors 

Here, we compare the ERM to the QMLE. 

We draw simulations from an AR(1) models (8.1) and a non linear model 
(8.2): 

X t = 0.5X t _i + e t (8.1) 

X t = 0.5sm(X t _ 1 )+e t (8.2) 

where St are iid innovations. We consider two cases of distributions for e t : the 
uniform case, e t ~ U[—a,a], and the Gaussian case, e t ~ Af(0, a 2 ). Note that, 
in the first case, both models satisfy the assumptions of Theorem 8: there exists 
a stationary solutions (Xt) that is mixing when the innovations are uniformly 
distributed and WeakDep((J) is satisfied for some C > 0. This paper does 
not provide any theoretical results for the Gaussian case as it is unbounded. 
However, we refer the reader to [AW12] for truncations techniques that allows 
to deal with this case too. We fix a = 0.4 and a = 0.70 such that Var(e t ) ~ 0.16 
in both cases. For each model, we simulate first a sequence of length n and 
then we predict X n using the observations (Xi, . . . , X„_i). Each simulation is 
repeated 100 times and we report the mean quadratic prediction errors on the 
Table 3. 

It is interesting to note that the ERM estimator with absolute loss performs 
better on model (8.1) while the ERM with quadratic loss performs slightly better 
on model (8.2). The difference tends be too small to be significative, however, 
the numerical results tends to indicate that both methods are robust to model 
mispecification. Also, both estimators seem to perform better than the R QMLE 
procedure when n = 100, but the differences tends to be less perceptible when 
n grows. 
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Table 4 

Performances of the Gibbs, AIC and "full model" predictors on simulations. 



| Model | Innovations || Gibbs | AIC | Full M..TTTT 



100 


(8.3) 


Uniform 
Gaussian 


0.165 (0.022) 
0.167 (0.023) 


0.165 (0.023) 
0.161 (0.023) 


0.182 (0.029) 
0.173 (0.027) 




(8.4) 


Uniform 
Gaussian 


0.163 (0.020) 
0.172 (0.033) 


0.169 (0.022) 
0.179 (0.040) 


0.178 (0.022) 
0.201 (0.049) 




(8.5) 


Uniform 
Gaussian 


0.174 (0.022) 
0.179 (0.025) 


0.179 (0.028) 
0.182 (0.025) 


0.201 (0.040) 
0.202 (0.031) 


1000 


(8.3) 


Uniform 
Gaussian 


0.163 (0.005) 
0.160 (0.005) 


0.163 (0.005) 
0.160 (0.005) 


0.166 (0.005) 
0.162 (0.005) 




(8.4) 


Uniform 
Gaussian 


0.164 (0.004) 
0.160 (0.008) 


0.166 (0.004) 
0.161 (0.008) 


0.167 (0.004) 
0.163 (0.008) 




(8.5) 


Uniform 
Gaussian 


0.171 (0.005) 
0.173 (0.009) 


0.172 (0.006) 
0.173 (0.009) 


0.175 (0.006) 
0.176 (0.010) 



8.2. Sparse autor -egression 

To illustrate Corollary 1, we compare the Gibbs predictor to the model selection 
approach of the ARMA procedure in the R software. This procedure computes 
the QMLE estimator in each AR(p) model, 1 < p < q, and then selects the order 
p by Akaike's AIC criterion [Aka73]. The Gibbs estimator is computed using a 
Reversible Jump MCMC algorithm as in [ALII]. The parameter A is taken as 
A = n/var(A), the empirical variance of the observed time series. 
We draw the data according to the following models: 

X t = 0.5X t -i + 0.1Xt-2 + et (8.3) 
X t = 0.6X t _4 + 0.1X 4 _ 8 + £ t (8.4) 
X t = cos(AVi) sin(AV 2 ) + e t (8.5) 

where et are iid innovations. We still consider the uniform (e t ~ U[—a,a\) and 
the Gaussian (e t ~ Af(0,a 2 )) cases with a = 0.4 and a = 0.70. We compare 
the Gibbs predictor performances to those of the estimator based on the AIC 
criterion and to the QMLE in the AR(q) model, so called "full model" . For each 
model, wc first simulate a time series of length 2n, use the observations 1 to 
n as a learning set and n + 1 to 2n as a test set, for n = 100 and n = 1000. 
Each simulation is repeated 20 times and we report in Table 4 the mean and 
the standard deviation of the empirical quadratic errors for each method and 
each model. 

Note that the Gibbs predictor performs better on Models (8.4) and (8.5) 
while the AIC predictor performs slightly better on Model (8.3). The difference 
tends to be negligible when n grows - this is coherent with the fact that we 
develop here a non-asymptotic theory. Note that the Gibbs predictor performs 
also well in the case of a Gaussian noise where the boundedness assumption is 
not satisfied. 
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Appendix A: A general PAC-Bayesian inequality 

Theorems 1 and 5 are actually both corollaries of a more general result that we 
would like to state for the sake of completeness. This result is the analogous of 
the PAC-Bayesian bounds proved by Catoni in the case of iid data [Cat07]. 

Theorem 9 (PAC-Bayesian Oracle Inequality for the Gibbs estimator). Let us 
assume that LowRates(re) is satisfied for some n > 0. Then, for any X, e > 
we have 



R [ 6 X ) < inf 

P eM\(e) 



Rdp 



2/C(p,^) + 21og(2/ e ) 



n (1 — k/n) 



X 



> 1 -e. 



This result is proved in Appendix B, but we can now provide the proofs of 
Theorems 1 and 5. 

Proof of Theorem 1. We apply Theorem 9 for 7r = -jj X^eee ^ ano - restrict the inf 
in the upper bound to Dirac masses p € {Se, 9 £ 0}. We obtain IC(p, ir) = log M, 
and the upper bound for R(9\) becomes: 



R ( X ) < inf / Rdp + 

V / pe{5 9 ,eee} J n {\ 

R{6) 



2 log (2M/e) 



inf 

flee 



(1 - k/n) 2 
2Xk 2 2 log (2M/e) 



n (1 — k/n) 



X 



Proof of Theorem 5. An application of Theorem 9 yields that with probability 
at least 1 — e 



R(0x) < inf 

P eM\(e) 



Rdp 



2Xk 2 



2/C(p,^) + 21og(2/ £ ) 



n (1 — k/n) ^ 
Let us estimate the upper bound at the probability distribution p$ defined as 



dp£ = 1{R(0) - Rjff) < 5} 
dn [> J tee l{R(t)-R(e)<S}7r(dtY 



Then we have: 



RlOx 



< inf 

i5>0 



R(9) + S- 



2Xk 2 



n (1 — k/n) 

- log / tee l{R(t) - infe R < 6}ir(dt) + log 
A 
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Under the assumptions of Theorem 5 we have: 
R (h) < inf 

V / <5>0 



n (1 — k/n) A 
The infimum is reached for S = d/X and we have: 

R (h) < R(B) + — ^ + 2 ^^)±M|), 
^ ' ?i (1 — fe/n) A 

■ 

Appendix B: Proofs 
B.l. Preliminaries 

We will use Rio's inequality [RioOO] that is an extension of Hoeffding's inequality 
in a dependent context. For the sake of completeness, we provide here this result 
when the observations (X\, . . . , X n ) come from a stationary process (X t ) 

Lemma 1 (Rio [RioOO]). Let h be a function {W) n — > R such that for all x\, 

x n , yi, y n e W, 

n 

\h(xi, ...,x n )- h(yi,...,y n )\ < ^ -y t \\. (B.l) 
Then, for any t > 0, we have 

E (exp(i {E [h(X 1 , . . . , X n )} - h{X 1 , . . . , X n )})) < exp ( — {B + ) . 

Others exponential inequalities can be used to obtain PAC-Bounds in the con- 
text of time series: the inequalities in [Dou94, SamOO] for mixing time series, and 
[DDL+07, WinlO] under weakest "weak dependence" assumptions, [SLCB+12] 
for martingales. Lemma 1 is very general and yields optimal low rates of con- 
vergence. For fast rates of convergence, we will use Samson's inequality that is 
an extension of Bernstein's inequality in a dependent context. 

Lemma 2 (Samson [SamOO]). Let N > 1, (Zi)i^% be a stationary process on R fe 
and (f>r denote its (^-mixing coefficients. For any measurable function f : R fe — > 
[-M, M], any0<t< \I{MK 2 Z ), we have 



E(exp(t(Sjv(/) - ES N (f)))) < exp [8K^Na'(f)t 2 j, 
where S N (f) := £ti f(Z z ), K+* = 1 + Er=i VW and = Var(/(Z<)). 
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Proof of Lemma 2. This result can be deduced easily from the proof of Theorem 
3 of [SamOO] which states a more general result on empirical processes. In page 
457 of [SamOO], replace the definition of /jv(xi, . . . , x n ) by /jv(£i, ■ ■ ■ ,x n ) = 
S"=i 9( x i) (following the notations of [SamOO]). Then check that all the argu- 
ments of the proof remain valid, the claim of Lemma 2 is obtained page 460, 
line 7. ■ 

We also remind the variational formula of the Kullback divergence. 

Lemma 3 (Donsker-Varadhan [DV76] variational formula) . For anyir £ M\(E), 
for any measurable upper-bounded function h : E — > R we have: 

I exp(h)dir = exp sup I / hdp — K,(p, tt) ) . (B-2) 
J \peM\(E)\J J J 

Moreover, the supremum with respect to p in the right-hand side is reached for 
the Gibbs measure 7r{/i} defined by ir{h}{dx) = e' i( - :l: ^7r(dx)/7r[exp(/i)] . 

Actually, it seems that in the case of discrete probabilities, this result was 
already known by Kullback (Problem 8.28 of Chapter 2 in [Kul59]). For a com- 
plete proof of this variational formula, even in the non integrable cases, we refer 
the reader to [DV76, Cat, Cat07]. 



B.2. Technical lemmas for the proofs of Theorems 2, 6, 7 and 9 

Lemma 4. We assume that LowRates(K) is satisfied for some K > 0. For any 
A > and 6 € <d we have 



\ 2 k 2 
(1 - k/nf 

Proof of Lemma 4- Let us fix A > and 9 £ O. Let us define the function h by: 



E ( e W)-^))) V E(eWW») < exp (— ^— 



1 

h(x 1 ,...,x n ) = — — — — ^2 ^(fe(xi-i,...,Xi- k ),Xi). 

^ ' i=k+l 

We now check that h satisfies (B.l), remember that £(x, x') = g(x — x') so 

h (xi,. ..,x n )- h(yi, ...y n ) 
1 " I 

< K ( 1 + L \ ^2 \9{fe{xi-i,...,Xi- k ) -Xi) - g(f g (yi-i,...,yi-k) -yi) 

^ ' i=k+l 

- YTl (/e( x »-ii---) x i-fc) ~ x i) ~ {fe(yi-i,---,yi-k) -y%) 

i=fc+l 
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where we used Assumption LipLoss(A) for the last inequality. So we have 

h (xi, . ..,x n )- h(yi, ...y„) 
1 " f\\ 

- U\fe(xi-i,---,Zi-k)- fe(Vi-i,---,yi-k) 



i=k+l 



nth 

t=fe+l \j=l 

i=l \ 3=1 / 



Vi-o\\ + \\ x i -Vi\\ 



\\xi - yi\\ < ^2 W Xi ~ ViW 

i=l 



where we used Assumption Lip(L). So we can apply Lemma 1 with h(Xi, . . . , X n ) 
H^jr n (fl), E(fc(*i, . . . = ^^^6), and i = A(f + L)X/(n - k): 



E (e^^)-^^)]) < exp ( 



A 2 iir a (l + L) 2 (B + fl 00 , n (l))^ 



2ra(l- k/ny 



< exp 



A 2 A 2 (1 + L) 2 (ff + C)' 
2n(l-*Y a 



by Assumption WeakDep(C). This ends the proof of the first inequality. The 
reverse inequality is obtained by replacing the function h by — h. I 

We are now ready to state the following key Lemma. 

Lemma 5. Let us assume that LowRates(ft) is satisfied satisfied for some 
k > 0. Then for any A > we have 



VpeM^(e), 

jRdp<Jr n d P + n{1 ^ /n) . 2 t 



/ r n dp < J Rdp 



n{l-k/nY 



/C(p,7r)+log(2/e) 
A 

/C(p,7r)+log(2/e) 



A 



> 1 -£. 



(B.3) 



Proof of Lemma 5. Let us fix > and A > 0, and apply the first inequality of 
Lemma 4. We have: 



3 



(exp (A(ii(0)-r n (0)- — 
v \ \ nil 



Xk 2 



(1 - k/n)' 



< 1, 



and we multiply this result by e/2 and integrate it with respect to 7r(d0). An 
application of Fubini's Theorem yields 



2^2 



E / exp (X(R(9)-r n (6))- 



X K 



ri (1 — k/n)' 



-log(2/e) W(d0) < 
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We apply Lemma 3 and we get: 

/\ 2 2 
(R(6) - r n (fl))p(dfl) - M * . . 2 - log (2/e) - fC(p, tt) }) < §. 
n (1 — fc/n) J / z 

As e x > 1r + (x), we have: 

pjsup |a I (fl(fl) - r w (g)) p(dfl) - n {1 X y* /n)2 ~ log (2/g) ~ Kfo tt)| > o| < §• 

Using the same arguments than above but starting with the second inequality 
of Lemma 4: 

\k 2 



Ecxp (\(r n (9) - R(9) - - ))) < 1. 
v v n (1 — K/n) 7 7 7 

we obtain: 



p|su P |a I \r n {6) - R{6)\ P (d0) - - ^ - log f- ) - £(p, tt) > > } < 
A union bound ends the proof. 



e - - 2 



The following variant of Lemma 5 will also be useful. 

Lemma 6. Let us assume that LowRates(K) is satisfied satisfied for some 
K > 0. Then for any A > we have 



Vp€M\(Q), 



and 

\k 2 , log(2/e) 
ri(l-fc/«) 2 ' A 



r„(0) < t 



> 1-s. 



Proof of Lemma 6. Following the proof of Lemma 5 we have: 

P J sup J A / (R(9) - r n {6)) p(6S) - f*\ 2 - log (2/e) - K(p, tt) 1 > 1 < \. 
\ p [ J n{\- k/n) J J 

Now, we use the second inequality of Lemma 4, with 6 = 6: 

e(cx P (\(r n (6) R(6) Xk ] ))) < 1. 

\ \ \ n (1 — fc/n) 777 

But then, we directly apply Markov's inequality to get: 

Here again, a union bound ends the proof. ■ 
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B.3. Proof of Theorems 9 and 7 



In this subsection we prove the general result on the Gibbs predictor. 
Proof of Theorem 9. We apply Lemma 5. So, with probability at least 1 — e we 
are on the event given by (B.3). From now, we work on that event. The first 
inequality of (B.3), when applied to p x (d9), gives 



R(9)p x (d9) < / r n (6)p x (d6) 



Xk 2 



(1-k/ny A 



2 +^log(2/ £ ) + i/C(p A ,7r). 



According to Lemma 3 we have: 



1 



r n (0)p x (M) + -K(p x ,n) = mf I / r n {6)p{d6) + -K(p, tt) 



A 



1 



A 



so we obtain 



/ R(6)p x (d6) < inf J / r n (9)p(d9) + — 
J p \ J nil 



Xk 2 



(1 - k/nf 



AC(p,7r)+log(2/e) 



(B.4) 

We now estimate from above r(0) by R(0). Applying the second inequality 
of (B.3) and plugging it into Inequality B.4 gives 



/if 2 2Xk 2 
R(9) Px (d9) < inf { Rdp + -K(p, tt) + — - 
P \J A n(l-k, 



(l-k/n) 2 A 



log(2/e) 



> > 1 - Si 



We end the proof by the remark that 9 M> R(9) is convex and so by Jensen's 
inequality / R{9)p x (d9) >R(J 9p x (d6)) = R(9 X ). U 
Proof of Theorem 7. Let us apply Lemma 5 in each model Qj, with a fixed 
Xj > and confidence level Ej > 0. We obtain, for all j, 

{ VpeMKOj), 

J Rdp < Jr n d P+ + /C( ^' ) t S(2/ " ) 

and 

f r r\n < f Rr\n4- A J K J I K.{p,-Kj)+\og(2/e 3 ) 

, J rnd P <J R d P+ , i( i_ fc/ „)^ + a" 

We put Sj = pjE, a union bound gives leads to: 

f Vje{l,...,M}, Vpe M\(e 3 ), 

J Rdp < J r n dp + n{1 _ k/nf + x . 

and 

Jr n dp<jRdp+^f W+ '\*">> 
From now, we only work on that event of probability at least 1 — e. Remark that 
R{6) = R{6 X .. 3 ) 



>>!-£. 



(B.5) 
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< J R{Q)p x _ j(d8) by Jensen's inequality 

V'7 K{p x ^ 3 ) + \og(£) 



< / r nP 3 (d6)+ n , . . 2 ■ \ 
J n(l — fe/n) A 



by (B.5) 



= ^ f M y r ^ (dff)+ „(i-v») a 1 a 



by definition of j 

A . K 2 /C( p)7ri ) + l 0g (_|_ 



inf inf < / r n p(d0) -r 
i^AfpeM^Ce,) ]J n(l-k/ri) <\j 

by Lemma 3 

2Vf ^(ft^Q+log^) 



< inf inf < / i?p(d#) 
K^MpeMKe.) \J 



i<j<M peM^i&j) j 7 n(l-k/n) A,- 

by (B.5) again 

f ^— 2A 3 -«? ^log(-D,/<5)+log( i f-) 

< inf inf ^ i? 6> ■ + <S + + 2 

i<j<Mi>o 1 n (i _ k/nf Xj 

by restricting p as in the proof of Cor. 5 page 10 

f 2A,«| ^log(^)+log(i) 

< inf I MO A + - J —s + 2 * — 

i<i<Af n (i - k/nf Aj 



do 

by taking S = — 



i<i<M I a>o 1 n (i _ k/nf A 

by definition of Xj 

17. fD„e 2 nr^ lof 



< inf {R{6j) + 2- L—lJ-l lo 



i<j<M j 3 1 — fc/n j V n \ Kj y d, 



Proof of Theorems 2 and 6 

Let us now prove the results about the ERM. 
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Proof of Theorem 2. We choose n as the uniform probability distribution on 
and A > 0. We apply Lemma 6. So we have, with probability at least 1 — e, 

f y P eM\(e'), jRd P <fr n dp + -£? 



\ and rn( 0)<R(e) + ^^ n i^. " 

We restrict the inf in the first inequality to Dirac masses p € {Sg,9 € 0} and 
we obtain: 

wee, R{ e)<r n (e) + ^^ + l <p 

and r n (9)<R(9) + ^^ + ^. 

In particular, we apply the first inequality to 9 ERM . We remind that 9 minimizes 
R on and that Q ERM minimizes r n on 0, and so we have 

T>/nERAI\ < r inERM\ , A ^ , bg(M) + log (2/e) 

n (1 — fc/nj A 

<r fB) I A ^ 1 ^gW + iPgQVe) 

^ r «W "I" 2 "+" X 

n (1 — K/nj A 

< R(9) + 2Xk " + log(M) + 21og(2/ e ) 
7i (1 — k/n) 2 A 



<i?(0) 



2Ak 2 21og(2M/e) 



n (1 — k/n) A 

The result still holds if we choose A as a minimizer of 

2Ak 2 ^ 2 log (2M/e) 
n (1 — k/n) 2 A 

■ 

Proof o/ Theorem 6. We put 0' = {9 e K d : ||0||i < D + 1}. We choose tt as the 
uniform probability distribution on 0'. We apply Lemma 6. So we have, with 
probability at least 1 — e, 

(V P eM\(e>), jRd P <J r n d P + + 

\ and rn (5)< il (g) + _^_ 5I + 3s^5l. 

So for any p, 

R(9 ERM ) = J [R(9 ERM ) - R{9)]p(d9) + J Rdp 

< [ [R(9 E ™) - R(9)]p(d9) + [ r n dp + • ^ + bg {2/s) 



n (1 — k/n) 2 A 
< / [R(9 ERM ) - R(9)]p(d6) + I [r n {9) - r n (9 ERM )]p(d9) + r n {9 ERM ) 
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, A^ 2 | E(p,7r)+log(2/g) 
n(l — k/n) 2 A 

- e^\\ lPm + r n (6) + - Xk \ . + M^)+MgZ£) 

n (1 — K/n) A 
2Ak 2 /C(p,7r) +21og(2/e) 



<2K^ J \\e-e ERM \\ lP {A6) + R{6) 



n(l-fc/n) 2 A 
Now wc define, for any 5 > 0, ps by 

dp, i{||fl-^||<<y} 



dvr- f e> l{||t-^AM||<5}7r(dt) 



nee 

So in particular, we have, for any 5 > 
R(§ ERM ) < 2Ki/)6 + RQ§) 



2Xk 2 f lQ S J tge ,i { || t -^||<^(d t ) +21 °g( 2 / £ ) 



n (1 — k/n) 2 



But for any 6 < 1, 

I{||t - er"~" || < d}7r(dtj = dlog ( 

'tee 



5 



So we have 



R(9 ERM ) < inf i 2Kip6 + R{6) 



5< 



<i [ n(l- jfe/n) 

We optimize this result by taking 8 = d/(2\Kip), which is smaller than 1 as 
soon as t > 2Kip/d, we get: 

2 d i g ( Wmm ) + 2 log (2/e) 

7i (1- fc/n) 2 A 

We just choose A as the minimizer of the r.h.s., subject to t > 2Kip/d, to end 
the proof. ■ 

B.5. Some preliminary lemmas for the proof of Theorem 8 

Lemma 7. Under the hypothesis of Theorem 8, we have, for any 6 G 6, for 
anyQ<\<(n- k)/(2kKLBC), 



E cxp ■ 

and 



E cxp < A 



{ A [ (l - ^) (m m) r(6) + r(5)] } 
{ A [ (l + ^) (R(e) R(9)) r{9) + r{0) 



< 1, 



< 1. 
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Lemma 7. We apply Lemma 2 to N = n — k, Zi = (-?Q+i, . . . , Xj + fc), 
1 



f(Zi) 



R{0) - R(9) 



- £ (Xi+k, fe(X i+ k-l, Xi+l)) + (■ {Xi+k, fg{Xi + k-l, • • • , X i+ i)) 

and so 

S N (f) = [R(9)-R(9)-r(9)+r(6)], 
and the Zi are uniformly mixing with coefficients cf>^ = 



Note that i 



Er=i VW = 1 + E"=i V^Jki <kC by PhiMix(C). For any 9 and & in 9 
let us put 

V(9, *)=e{ [^(X fc+1 , Xi)) - £(x fe+1 , /^(X fc , Xi)) 

We are going to apply Lemma 2. Remark that <J 2 (f) < V(9, 0)/(n — k) 2 . Also, 

I (Xi+k, fe(Xi+k-i,- ■ .,Xi+i)) - i (Xi+k, fg(Xi+k-i,- ■ .,Xi+i)) 

< ^ |/e(-X"i+fc-ij ■ • ■ ,X l+ i) - fj(X i+ k-i, ■ . ■ ,X i+1 )\ < KLB 

where we used LipLoss(if) for the first inequality and Lip(L) and PhiMix(S, C) 
for the second inequality This implies that ||/||oo < 2KLB/(n — k), so we can 
apply Lemma 2 for any < A < (n — k)/(2kKLBC)}, we have 



hi E exp 



< 



8kCV(d, 9)\ 2 



\[R(9) - R(9) - r(9)+r(9) 

Notice finally that Margin(/C) leads to 

V(6,6) = 1C[R(6)-R(9)] 

This proves the first inequality of Lemma 7. The second inequality is proved 
exacly in the same way, but replacing / by — /. □ 

We are now ready to state the following key Lemma. 

Lemma 8. Under the hypothesis of Theorem 8, we have, for any < A < 
(n - k)/(2kKLBC), for any < e < 1, 



' VpeMUB) 



8kC\ 
n—k 



(fRd P -R(9))<jrdp-r(9) 



K(p,7r)+log(2/e) 
A 



i - 

and 

J rdp - r(9) < (J Rdp - R(9)) (l + 



>>!-£. 
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Proof of Lemma 8. Let us fix e, A and 9 € 0, and apply the first inequality of 
Lemma 7. We have: 



E exp 



-^){R(0)-m)-r(e) + r(e) 



< i, 



and we multiply this result by e/2 and integrate it with respect to 7r(d0). Fubini's 
Theorem gives: 



E / cxp^ A 



8kCX 

n — k 



(R(9) - R(0)) - r{6) + r{6) + log(e/2) 



>7r(d6») 



< 



We apply Lemma 3 and we get: 
8kC\\ 

Eexp< sup A ' 

p 



1 -^i){J Rdp ~ m )-J rd p+ r ^ 

+ log( e /2)-/C(p,7r) 



< 



As e x > 1r + (x), we have: 



sup A 

. p 



1-^) ( 



+ log(e/2) 



-JC(p,7r)>0 <-. 



Let us apply the same arguments starting with the second inequality of Lemma 7. 
We obtain: 



sup A 

p 



1 + ^t) I 



+ log(e/2)-/C(p,7r) 



A union bound ends the proof. 



B.6. Proof of Theorem 8 

Proof of Theorem 8. Fix < A = (n - k)/(AkKLBC) A (n - fc)/(16fcC) < 
(n — fc) / '(2kK LBC) . Applying Lemma 8, we assume from now that the event of 
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probability at least 1 — e given by this lemma is satisfied. In particular we have 
VpeMl(e), 



J Rdp - R(0) < 



J rdp - r{9) + 



i K(p,7r)+log(2/e) 



1 - 



8kCX 
n—k 



In particular, thanks to Lemma 3, we have: 
J Rdp x - R(9) ; 



/ rdp - r(6) + K(p '" )+log(2/e) 



Now, we apply the second inequality of Lemma 8: 
J Rdp x - R{6) 

f l + 8kCx\ [J Rdp _ R (fi\ + 2 /C(p.,)+log(2/ E ) 

< inf '- -. 

n— k 



1 + ^) Ump . m+2 ^Msl 



< inf inf 

3 peM\(&j) C l _ 8fcCA 



< inf inf 

j 5>o 



by restricting p as in the proof of Theorem 5. First, notice that our choice 
A < (n — fc)/(16fcC) leads to 



y Rdpx - R{6) < 2 inf inf i | + 5 - i?(6>)] + 2 



< 4 inf inf < R(6 j ) + 8-R{6) 



_ d,log(^)+log^ 



5 i e ft 



J 5>0 I A 

Taking 5 = dj/X leads to 

r f d > log f ^) + lo § f — ) 

/ i?d i 5 A - i?(0) < 4 inf I R(Q ) - R(8) + 

Finally, we replace the last occurences of A by its value: 
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J Rdp x - RQS) 

^log(^7 1 )+log( 



< 4inf < RIO.) - R{6) + (16kC V AkKLBC) — 

n — k 

Jensen's inequality leads to: 
R (§ x ) - R(6) 



d 

< 4inf { R(6j) - R(6) + AkC (4 V KLB) 



n — k 



